A GIBBS SAMPLER ON THE N-SIMPLEX 

aaron smith 

1. Abstract 

,_H We determine the mixing time of a simple Gibbs sampler on the unit simplex, 

^~~* confirming a conjecture of D. Aldous. The upper bound is based on a two-step 

z^ coupling, where the first step is a simple contraction argument and the second step is a 

,__! non-Markovian coupling. We also present a MCMC-based perfect sampling algorithm 

^ that is based on our proof and which can be applied to Gibbs samplers that are harder 

to analyze. 
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2. Introduction 



Pm Given a measure /i on a convex body K C W^, how can we efficiently obtain 

'i independent samples from the distribution of /i? This problem arises in the computa- 

'-^ tional sciences, and a frequently-used tool is Markov chain Monte Carlo (MCMC) f6|. 

^ Because MCMC methods produce nearly-independent samples only after a lengthy 

^ tt^ mixing period, a long-standing mathematical question is to analyze the mixing times 
of the MCMC algorithms which are in common use. 

T-H 

Q>^ The analysis of discrete MCMC algorithms is very advanced, with precise bounds 

CN for many difficult problems as well as some general theory [o] [2]. For continuous 

(^ run into technical difficulties. There are also very few well-understood simple chains, 

^—1 in stark contrast to the discrete theory, which has been built on many detailed anal- 



samplers, there has been some general theory based on geometric or coupling argu- 
ments 11 10 16 15 , but many of the techniques built for discrete chains seem to 



yses of specific chains (though see 13 14 for some very nice analyses of some slower 



walks on the simplex). This paper is an attempt to carefully analyze a simple con- 

^ tinuous chain, a Gibbs sampler on the n-simplex, and to illustrate the use of two 

powerful techniques from the discrete case, non-Markovian couplings [s] [i] [s] and 



coupling from the past 12 



Throughout most of this paper, we will be concerned with a Gibbs sampler Xt 
having the uniform distribution on the n-simplex A„ = {X\ Y17=i A, = 1; Aj > 0} as 
its stationary distribution. To take a move in this Markov chain, choose 1 < i < j <n 
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and A G [0, 1] uniformly, then set Xt+i[i] = X{Xt[i]+ Xt[j]), Xt+i[j] = (1 - X){Xt[i] + 
Xt[j]) and Xt_,_i[A;] = Xt[k] for k ^ i,j. This sampler was first mentioned in [2], 
where the mixing time was shown to be O {n^ log{n)) . D. Aldous suggested |l] that 
the correct mixing time was 0{nlog{n)), and we confirm this: 

Theorem 1 (Simplex Mixing Time). // K^ is the t-step transition kernel for the 
Gibbs sampler described above, and Un is the uniform distribution on A„, then for 
t>7{C + 6.5)nlog(n) and C >0, 

\\K\x,y)-U{x)\\ <8n~^ 

for all X G A„. 

After proving this lemma, we will briefly discuss how to turn the proof into a per- 
fect sampling algorithm, and mention directions for further work. 



3. Notation, Basic Lemmas and Strategy 

This proof relies on the following standard lemma (see P^): 

Lemma 2 (Fundamental Coupling Lemma). If Xt, Yt are two coupled instances of 
the same Markov chain, Yq is distributed according to the stationary distribution of the 
Markov chain, and r is the first time at which Xf = Yt, then dTvi^t, Yf) < P[t > t] 

Throughout this note, we are interested in two Markov chains, Xt and Yj, where 
Xt starts according to some distribution of our choosing and Yt starts out uniformly 
over the simplex. We will describe a joint evolution of our two chains Xt and Yt, 
such that at a specific time, the probability of having coupled is very high. The 
method for proving this is slightly unusual. In most coupling proofs, including the 
non-Markovian coupling in M , there is an attempt to make the two chains get closer 
throughout the process. In our method, we attempt to couple only at a specific final 
time, and include many moves that are likely to increase the distance between the 
chains by a large amount. In fact, our joint distribution will generally assign prob- 
ability to coupling at any prior time. 

Throughout the document, I try to be consistent about notation. Here is a list of 
some commonly used variables that I have tried to reserve, for reference while reading: 

Xt'. The Markov chain of interest. 

Yt'. Another instance of the Markov chain, started at stationarity. 

P{t): A partition of [n]. 
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S: A piece of a partition. 

i,j: Coordinates we update. 

A, Xx, Xy'- Uniform random variable used to update a chain, or chains Xt and Yt. 

w{S,x): The weight assigned by vector a; to a subset S C [n]. 

b, c, d, e: Exponents used to bound the size of certain often-used quantities, b 
is used to bound infjl^[i], c is used as a placeholder for comparing the scale 
of two quantities, d is used to bound -E[||X( — Vt||2], and e is used to bound 
supi \Xt[i] -Yt[{\\. 

In order to develop our global joint coupling, we describe two possible one-step cou- 
plings of Xf and Yf. These are the 'proportional' coupling and the 'subset' coupling. 
Throughout, we will always choose to update entries at the same coordinates i,j in 
both Xf and Yt at every step; only the uniform variable A sometimes differs. Because 
of this, we often use these couplings to refer to actual one-step couplings as well as 
couplings of A conditioned on the choice of update coordinates. Before defining the 
couplings, we need the following (standard) technical lemma. 

Lemma 3 (Coupling Existence). Let f{t) g(t) be two probability density functions 
on [0, 1], and let < a < 1 be a real number s.t. ag{t) < f{t) for all t. Then it 
is possible to define random variables {X,Y) s.t. X = Y with probability at least a 
and X is marginally distributed according to f{t) while Y is marginally distributed 
according to g{t). 

Proof: Let r(t) = ^^(/(t) - agit)) > 0; we note that /Jr(t) = x^(/,/(t) - 
aj^g{t)) = 1, and so r(t) is the density of a distribution. Now we choose Y ac- 
cording to the distribution g{t), and we choose X = Y with probability a, and 
choose X according to the distribution r{t) with probability 1 — a. It is clear that 
X = Y with probability at least a, and that Y has the correct distribution. To 
see that X has the correct distribution, note that for any set A, P[X & A] = 
<^lA9{t) + (!-«) Xi jh^ifit) - ag{t)) = j\f{t), as we wanted. 

In the proportional coupling, we choose an i, j and A for 1^, and then use the same 
choices for X^, so that e.g. entry i in Yt is updated to A(Ft[i] -|- Yt[j]) while entry i in 
Xt is updated to A(X([i] -|- X([j]). To define the 'subset' coupling, we first define the 
weight w{S, x) that a vector x gives to a subset S C [n] to be w{S, x) = J2ses ^«- "^^^ 
subset coupling of Xt and Yt with respect to a subset S* is a coupling that maximizes 
the probability that w{S, Xt+i) = w{S, Vt+i), and we say that a given step of a subset 
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coupling succeeds if that equality holds at time t + 1, and fails otherwise. Throughout, 
we generally care about whether such a coupling succeeds, not what happens when it 
succeeds or fails. For that reason, we don't pay attention to which optimal coupling 
we use, nor do we pay attention to what happens upon failure. 
For a pair of points {x,y) in the simplex, a pair of update entries {i,j), and a subset 

5 C [n] of interest such that i E S and j not in S, we define p{x,y,i,j, S) to be the 
subset coupling probability. Below, we give a lower bound for the success probability, 
which will be enough for us. Note that for some choices of x, y, i,j, S, the probability 
of success is under any coupling. 

Lemma 4 (Subset Coupling). For a pair of vectors {x,y) satisfying sup^ \xi — yi\ < 
n~^ and infjXj, infj|/j > n~^, for e > h, we have for all sufficiently large n that 
p{x,y,i,j,S) > 1 — 2n*'^^"^ uniformly in S and possible i,j. 

Proof: We consider two elements of the simplex, {xi, . . . ,Xn) and (yi, . . . ,y„). 
Assume we are interested in coupling the weights of subset S* = J U (1) C [n], and 
that coordinates 1 and j, neither in J, are being updated. We are updating vectors 
X, y with the random variables A^;, Xy and would like to find the probability that they 
can be coupled. We note that the balance condition is exactly 

(1) K{xi + Xj) + ^Xi = Xyiyi + yj) + ^yi 

which we can restate as 

(2) K = x/-^ + ^j:(y 



Ju \ I Ju j JU ]_ I JU 'j 



Xi 



yi+Vj 



Assume for now that ^^,J, > 1. Then, by the coupling existence lemma, a valid 
coupling is to choose Xy according to the uniform distribution on (0, 1), and to choose 
Xx = Xy^-^^ + ^ ]_^ X^ie/d/i ~ ^i)^ ^^ loiig s-s that is a value between and 1, and to 
choose from some (unimportant) remainder distribution with the remaining probabil- 
ity. Let m = ^^ , ^^ and let 6 = — } — y^,v-r(% — Xi). Then integration tells us that this 

results in a successful coupling with probability inf (1, ^) — sup(0, — ^). If fqif^ < 1, 
then we choose Xx first and then choose A^ according to the equivalent formula; this 
succeeds with probability inf(l, 1 + 6) — sup(0,(5). Thus, if \m — 1|, \m~^ — 1| < ei 
and \6\ < €2, then the coupling probability is at least 1 — 2^^. 
Next, note that if supj |xj — yi\ < n~^ and infjXj, infj|/j > n^^, then we can take 
ei = n*"*^ and €2 = n^"*"^"^. This proves the lemma. 

We can now describe the overall strategy. In the first phase, we show that the 
proportional coupling is contractive, and prove that the two chains are very close 
with high probability after about n log(n) steps. In the next phase, we run Yt until a 
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specified time T. We use information about Yt from time to T to construct a nested 
sequence of partitions, starting with one set at time and normally ending with n 
singletons. We then show that, if the partitions are chosen correctly, it is possible to 
make sure that each piece of the partition has the same weight in both Xt and Yt for 
all times, with high probability. When the final partition consists of only singletons, 
this implies that the chains have coupled. The main difficulties here are constructing 
the partition, and then showing that the conditions required for the subset coupling 
lemma apply with high probability. 

It is worth pointing out that the dependence of the coupling on the future is in 
fact necessary to get the correct mixing time, or indeed any bound that is o{n'^). 
This is analogous to the well-known fact that no Markovian coupling of the random 
transposition walk on Sn can give a coupling time that is o(n^). 



4. First Coupling Stage 

We define Zt = \\Xt — Yt\\2- We then compute E[Zt\Xt-i,Yt-i] under the 'propor- 
tional coupling' described above. We have 



Lemma 5 (Burn-In). After t = ^dn\og{n) steps of the proportional coupling, E[Zt] < 
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Proof: We perform the following computation 

E 






2 4 

+ 7^{xj - yjf + -{xi - yi){xj - yj)] 



^^ 3(n-l) ^3n(n-l)^^*"' 

2 2 4 

^ 3(n-l) 3n(n-l)^ * ^ 3n(n - 1) * ^ 

2 2 

~ ^ ~ 3(n-l) ~ 3n(n-l)'' *"^ 

And so in particular, E[Zt] < (1 — ^)*^o < 2(1 — ^J^, and so at time S = |(inlog(n), 
-E[Zi] < 2n~'^, proving the lemma. 

By Markov's inequality, P[|Xi[i] — Yt[j]\ > e] < e^^n^2'^. As an aside, if we couple 
Xt and Yt by choosing the same coordinates i,j and uniform random variables A, A', 
one sees that the one-step coupling which minimizes the distance Zt described above 
occurs when £'[AA'] is maximized, and this maximum occurs at A = A'. 

5. Second Coupling Stage 

Assume that at time 0, Xq and Yq satisfy | |Xo — Vol I2 — "^ '^^ We describe a couphng 
from time to time T = (^ + e)nlog(n), which has a high chance of succeeding for 
any e > 0. First, we choose a sequence of pairs of distinct elements {i{t),j{t)) for time 
1 < t < T. From this sequence, we will construct a nested sequence of partitions of 
[n], -P(O) > -P(l) > . . . > P{T), where we say that partition A is less than partition 
B if every piece of partition A is contained in a piece of partition B. We will also 
construct a sequence of marked times ti, . . . ,tk, and a sequence of graphs Gt on the 
vertex set [n]. 
The first partition, P{T), consists of n singletons, our list of marked times is initially 
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empty, and the graph Gt has no edges. To construct P{t — 1) from P(t), we look at 
the edge {i{t — l),j(t — 1)). If it goes between elements in two parts of P(t), we join 
those two parts to make P{t — 1); otherwise, we set Pit — 1) = P(t). If two parts 
are joined at time t — 1, we call them 5'(t, 1) and Sit^ 2), and add time t to our list 
of marked times. Finally, at time t we add edge {i{t — I), jit — 1)) to the graph Gt 
if it wasn't already present, to create Gt~i- We make a few elementary observations. 
The first is that there are at most n — 1 marked times. The next is that there are 
n — 1 marked times if and only if P(0) = [n] which in turn occurs if and only if Gq is 
connected. 

The question of whether Gt is connected is classical. The following result, found [s] 
among other places, is good enough for our purposes: 

Lemma 6 (Connectedness). For T > (^ + e)n\og{n) and e > 0, the probability that 
Gq is greater than 1 — 2n~'^ for n sufficiently large. 

Having constructed this partition, we now couple Xj and Yt from time < t < T. 
First, we need to choose the coordinates to update; we do this by updating coordinates 
i{T — t) and j{T — t) at time t in both chains. Note that this is a time reversal with 
respect to the order we looked at coordinates when building the graph. Since our 
chain is reversible, this doesn't present a problem for our coupling. Next, we must 
describe the coupling of the coordinates. If T — t is a marked time, then we perform a 
subset coupling for the subset S{T — t, 1). Otherwise, we do a proportional coupling. 
I claim that this couples the two walks by time T with high probability. To prove 
this, we need the following three lemmas. 

Lemma 7 (Technical Lemma 1). n'i'=i(l + kj^^) < {k + l)n'^ 

Proof: This is a straightforward induction on n. Note that the statement is true 
for n = 1 and all k, and assume it is true for 1 < n < N. Then we note that 

N+l , N 

n (1 + kj-') = (1 + j^ x{{i + kj-') 

<{k + i)N'' + {e + k)N 

<{k + l){N + lf 
Proving the lemma. 

Lemma 8 (Technical Lemma 2). Let n = So > Si > . . . > Sk = m be a de- 
creasing sequence of integers, such that 5*^+1 > ^Sj. For such a sequence, define 

/(S'o, . . . , Sk) = ni=o(-'- + ~^~s^^^)' '^'^^ ^^^ F{n,m) be the supremum of f{So, . . . , Sk) 
over all such sequences. Then F{n, m) = f{n, n — 1, . . . ,m). Further, F{n, m) < 2n 
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Proof: We prove this by simultaneous induction on n, m. It is clear for all 

n that F{n,n — 1) = f{n,n — 1). Now fix an A^ and M, and assume that for 
all m > M F{N,m) = f{N,N — l,...,m) and that for all n < A^ and all m, 
f{n, m) = f{n, n-l,...,m). We will show that F{N, M) = f{N, N -1,...,M). 
To do so, assume this isn't the case. Then there is some other sequence A^ = 
So, . . . , Sk = M such that /(S'o, . . . , Sk) > f{N, . . . , M). From the induction hy- 
pothesis, this is impossible if 5*1 = n — 1, for if we did, we would have 

f{So,...,Sk) = {l + -)f{Su...,Sk) <(l + -)/(n-l,...,m) 

n n 

= f{n,...,m) 

Thus, we can assume 5*1 < n — 1. By the induction hypothesis, we must have 
Sj+i = Si — j for all j > 1, and so /(S'o, . . . , Sk) > f{N, . . . , M) implies 

(I 



M - n 1 ) 




^ n 1 '^^~^n 1 


1 


n n - 
n 


-Si + l 



Where the second line is due again to the induction hypothesis. This is a contradic- 
tion, and we conclude F{n, in) = f{n, n — 1, . . . , tti) for all n, m. 
To prove the last part, we note that 

F{n,m) < F{n,l) 

n 

< 2n 
where the last step is due to technical lemma 1, above. 

Lemma 9 (Largeness). P[infi<j<„info<i<T^t[^] — "^ '^'^] = ^(l) 
Proof: This calculation is done on pp. 11 of chapter 13 of [AF]. 

Lemma 10 (Smallness). Let A{s) be the set on which sup^\Xt[i] — Yt[i]\ < 2n'^ 
and infjXj, infjt/j > n~*^'^ for all < t < s, and call this 'condition A'. Then 
P[(supo<,<j ||X, - Y,\\l > n^\\Xo - ro||2) n A{t - 1)] < 2tn'^-'' for all c> 0. 

Proof: We will estimate -E[||^t — ^t||2] while conditioned on something slightly 
stronger than all subset couplings succeeding until time t, by bounding how much 
the term changes during each successful 'subset coupling'. At the same time, we will 
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estimate the probability that condition A fails at time t. To set notation, assume 
the subset couphngs happen at times ti, . . . ,tn-i between sets S(tk, 1) and S(tk,2) 
where \S{h,l)\ < \S{h,2)\. We write \\Xt - Yt\\ls = EsesiM^] ' ^^8])^ the L^ 
norm restricted to indices contained in the subset S. We let i{tk) G S(tk, 1) and 
j(tk) G S(tk,2) be the two coordinates updated at time tk- If the coupling at time 
tk was successful, then we have Xt^[i(tk)] = >^tj«tj + Zl«g5(tfe,i)(^tJ^] " ^tM)- ^^ 
particular, we can make the following computation. In it, the expectation is taken 
with Xti^_i, Ytf,_i, tk, the size oi S(tk, 1) (but not the set), the old partition P(tk) (but 
not the new partition P{tk + 1)) and the part of the partition p that is being broken 
in this step fixed; we are viewing the particular sets S{tk, 1) and S{tk,2), and the 
coordinates i{tk) and j{tk), as the random variables. We also condition on the update 
variable Xy at time t being in the range described in the 'subset coupling lemma' 
based on condition A (looking at that lemma shows that this is just conditioning on 
a < Xy < (3 for some fixed a{n) ~ and /3(n) ~ 1). Note that we can condition 
on Xy being in that range even if condition A doesn't hold; we will simply no longer 
be guaranteeing that the subset couplings all succeed. As an aside, I condition on 
Xy being in a particular range rather than on the subset coupling succeeding because 
the partition process P(t) is independent of the update variables Aj^, but it is not 
independent of the event that a given subset coupling succeeds. 
I will write F{tk) for the a-algebra described above. The reason for using this a- 
algebra is that, conditioned on P{tk), the part of the partition p{tk) to be split int 
two at time tk, and the size of S(tk, 1), all subsets of p of size \S(tk, 1)| are equally 
likely. Thus, we find that if condition A holds at time tk. 



E[{XtMtk)]-YtMtkmF{tk)]=E[{ J2 (Yt.Ml] - Xt,-i[ 

ie5(tfe,i) 

= E[ Yl iyt,-i[i] - x.^^^m + E[ Y. (Y.^^m - xt,-iMY,,^iH - x.^^^h)] 
\s{tk,i)\ _ 2 \sitk,m\sitk,i )\-i) 

\p\ " ''-' *'="'"''^ bl(bl-i; 



|2 






Where in the third fine I use the fact that "^g^^gu uusa 2)("^tfc-i['^] ~ ^tfc-if-^]) — 
by the assumption that all subset couplings up to this time have succeeded. This 
implies that 



|2 iprf ^1 ^ n 4-M^^illln l'g(4,i)l -I aaiiy y ii2 
\2,p 1^ {tk)\ < (1 + — ^^ — (1 ^^3^ — JJl|At,-i - n,-i||2,p 
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and also that 

E[\\X,, - Y,,\\l \F[t,)] < (1 + M^(l - ^^^^p^))ll^*.-i - yt.-A\l 

Before the next step, we note that the same argument which shows that the pro- 
portional coupling is contractive on the LP' distance between Xt and Yt shows that 
it is contractive on the LP distance between Xt and Yt restricted to any part p{t) of 
the partition P(t), without any change. Because of this, in the following calculations 
we are free to ignore the proportional coupling steps; following along, it is easy to 
see that they would only improve our estimates. We also note that, for any 'part' 
of a partition, only one entry is ever used to 'link' it when creating the next-coarser 
partition. Let Jj^fc be the indicator function of i being in p{tk), and J{i) C (ti, . . . , t„) 
be the set of marked times where the splitting partition contains index i. Then we 
find that, for any particular 2, and for t^+i > t > t^, 

EliXtl^] - Yt[i]f; A{t - 1)] = E[[E[. . . E[{Xt[i] - FTH)'|F(tfc)] |F(4_2)] • • • \F{h)];A{t - 1)] 

^^[ n ii + ^-^^y,Am\x,-Y,\\i 

seJ{i)n{o,t] '^^ ^^ 

< E[2n||Xo-Fo"^ 

|2 



OII2J 

2n||Xo-Fo||2 



Where the second last line is by technical lemma 2. There is a question as to why we 
can take this calculation over the set where condition A holds. The answer is that at 
marked times we pretend we are running the chain by choosing A^; , Xy according to the 
relation ([2]); this chain, of course, is no longer confined to being in the simplex. This 
preserves the formula used above, and at the end we just set to all contributions from 
parts of the chain which left the simplex, which can only decrease the expectation. 
Continuing, we sum this over all i, to find that 

E[\\Xt-Yt\\l;A{t)]<2n'\\Xo-Yo\\l 

Thus, by Markov's inequality, and taking a union bound over < s < t, we have 

P[(sup \\Xs-Ys\\l >n'\\Xo-Yo\\l)r]A{t)] <2tn^-^ 

0<s<t 

where we implicitly use the fact that P[{X > e) (1 S] = P[lsX > e] for nonnegative 
random variables X and e > 0. 



Lemma 11 (Condition Lemma). Condition A as defined in the smallness lemma 
holds until time T with probability at least 1 — 2n^+^^~'^ + o(l). 
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Proof: From the smallness lemma and the proof of the largeness lemma, we find 
for all s that 

P[Aisy] < P[snp\\Xt - YtWl > n-^'] + o(-) 
t<s n 

< 2sn^-^^-'^ + P[A(s - lY] + o(-) 

n 

< 2n^+2e-d ^ P[A{Oy] + o(l) 
= 2n^+2e-d ^ ^^^^ 

which proves the lemma. We note that this is one of several places that an extra 
factor of n is picked up; this turns out to be fairly unimportant. 

Lemma 12 (Weight Lemma). Assume \\Xq — Yq\\ < n^'^. Then the equality 

(3) w{S,Xt)=w{S,Yt) 

holds for alio <t <T and all S C P{t) with probability at least 1 - n^+^^"'^ - n'^-S-e + 
o(l), for any choice of e > 0. 

Proof: The equality clearly holds at time 0, and it can only become false at a 
marked time, since at unmarked times t, the weights of parts of P{t) cannot change 
in either Xt or 1^. So, assume that the proposition is true for time t < tk- Then note 
that if the subset coupling is successful at time t = t^, w{S(t, l),Xt) = w{S(t, l),Yt) 
by construction. However, by induction, w{S(t — 1, 1) U S(t — 1, 2),Xt-i) = w{S(t — 
1, 1) U S{t - 1, 2), Fi_i), and so w{S{t, 2),Xt) = w{S{t, 2), Yt) as well. Since none of 
the other parts of P{t — 1) change weight, this implies that the proposition remains 
true unless one of the n subset couplings fails. 

If condition A holds, the probability of any coupling being successful is at least 
1 — 7^5. 5-e i^y ^]-^g subset coupling lemma. By the condition lemma, this condition 
holds for all n with probability at least 1 — 725+26-^ _|_ Q^^iy Thus, a union bound 
tells us that the probability of any of the n subset couplings failing is less than 
^5+2e-d _|_ ^6.5-e _|_ q^i"j ^ proviug the lemma. 

Now recall that if the graph Gt is connected and all components K G P{t) satisfy 
w{K,Xt) = w{K,Yt) for < t < T, then at time T the two walks have coupled. 
Assume that we have a burn-in period of 6Cn log(n) and a final coupling period of 
T = Cnlog{n). Then choosing d = 3C and e = C in the burn-in lemma gives us a 
total chance of coupling greater than 1 — 2n~^ + n^~'-'' + n^-^-c _|_ 2n^-^~'-'' . This proves 
our theorem. Note that a different choice of d, e for C small gives us some 'pre-cutoff ' 
at a slightly smaller overall multiple of n log(n) steps. 
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6. Lower Bound 

Since our walk is over a continuous space, coupling cannot have occurred unless 
each coordinate has been chosen. Since only two coordinates are chosen at a time, 
the classical coupon-collector results in m tells us that at time T = |n(log(n) — c), 
dxvi^t, ^) > 1 — exp{—exp{c)) + o(l) as n goes to infinity. 

It is possible to do a little better than this. Assume we start at position (1, 0, . . . , 0). 
We then try to 'collect' the n coordinates, but only count a coordinate as 'collected' 
if it is chosen at the same time as a non-zero coordinate. Next, let Tj be the expected 
time to collect the j'th coupon after collecting the j — I'st. We note that E[ti] = n, 
E[t2] = 0. For larger j, E[t,+,] = -^y Thus, 

n n— 1 ^ 

j=l j=2-'^ •>' 

~ In log(ra) 

And in particular the same concentration argument tells us that the mixing time is 
at least 2n(log(r2) — c(e)). 



7. Closely Related Walks 

It is worth pointing out a small number of cases where the above argument goes 
through with very few changes. The first allows us to go from sampling from the 
uniform distribution to sampling from a large class of distributions on the simplex, 
including symmetric Dirichlet distributions. At each step of the random walk, instead 
of choosing A according to the uniform distribution on [0, 1], choose it according to 
some other distribution with twice differentiable cdf F satisfying F\x\ = 1 — F[l — x] 
for all < X < |. Then the above arguments show that the total mixing time is 

0{nlog{n) YZ2^^) , essentially without modifcation. 

It is also possible to apply this argument to its 'discrete' analogue, in which M indis- 
tinguishable balls are stored in n boxes; these are known as M-compositions of n. The 
analogous Markov chain involves choosing two boxes at every step, and having each 
ball switch with probability 1 2 . The same argument applies to the discrete chain, 
giving a mixing bound of order 0{nlog{n)), but there need to be enough balls for the 
continuous approximation to be good at each step. A straightforward step-through 
of the argument gives a bound of M > n^^'^ above, or M > n^'^ for Aldous' greedy 
coupling. 

There is a follow-up paper being prepared which will discuss a wider variety of related 
walks, requiring larger modifcations. 
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8. Perfect Sampling on the Simplex 

In this section, I discuss how the two-chain couphng described above can be mod- 
ified into a grand couphng, and how to use this fact to create a perfect samphng 
algorithm. First, we recaU the couphng from the past (CFTP) algorithm, described 



in greater detail in 12 . First, choose some large time T, and start a copy of the 
Markov chain X'^rp for each u in the sample space Q. Next, couple all of the chains 
from time — T to time 0. If the chains have coallesced by time 0, the resulting single 
value is distributed according to the stationary distribution of the chain. If not, we 
couple chains started at all points from — 2T to T and keep the evolution from — T 
to 0, then from — 3T to — 2T keeping the evolution from — 2T to 0, and so on until 
coallescence at has occurred. 

For Markov chains on a finite state space, it is easy in theory to construct a grand 
coupling that will eventually coallesce, though bad couplings are very inefficient. In 
practice, even on finite chains, CFTP is only used if the chain has some very special 
properties. The most popular property, introduced in the original paper, is 'mono- 
tonicity'. Briefiy, we introduce a partial order < on Q, and say that a coupling of two 
chains Xt, Yf is monotone if Xq < Yq implies Xf < Yt for all t > 0. It is then easy 
to see that if our grand coupling is monotone, it is sufficient to keep track of chains 
started at maximal and minimal elements of the poset. If they have coupled, all states 
have coupled. Most uses of CFTP rely on monotonicity or its twin, antimonotonicity. 
For Markov chains on infinite state spaces, many grand couplings will never coallesce, 
and of course we can't keep track of all of the starting values on a computer. Some 
chains have a monotonicity property, but such a property isn't obvious for the simplex 
model. Despite this, there is a fairly efficient perfect sampling algorithm that requires 
tracking only n + 1 points (and a little extra overhead each time an epoch of length 
T fails). 

Let X^ be a copy of the Markov chain started at v at time 0, and let Cj be the j'th 
unit standard basis vector. We will do a proportional coupling from time < t < Ti. 
Then we note that there exists a matrix \\[j] such that for any f , X"^ = X]?=i -^tbl'^i 
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. Thus, 



i=l 



i=l 3 = 1 



Vj-Wj, 



< 



^nsupXllJ] 
i=i ^ 

n 



e.U 



=1 j,k 

[I- 



n 

j,k 



^e,' 



Applying the older burn-in inequality to the expectation of \\X^^ — X^''\\i for all 
distinct pairs j, k, then Markov's inequality to the probability that the above norm 
is large, we find that for t > |T?mlog(n), 

and so taking a union bound, and applying the inequality proved just above, 

P[ sup \\X^ -X^\\i> n^™+i^] < n-^^ 

This tells us that after a first step of 0{n log(n)) steps, the L^ distance between any 
two points is extremely small. The second step of the coupling is almost identical to 
the proof given in the first section of this note. In particular, we run X^ '"'"' from 
time Ti to time T, recording all choices of edge and averaging variable. We then form 
the same partition process, and use it to attempt maximal couplings of all variables 
to this special chain. As long as the X/ continue to be a sufficiently small ball as 

measured in L^ metric, and X^ ''"'" remains far from 1, we can simultaneously 
couple all of these chains with high probability. 

It remains to determine what happens if one of the above maximal couplings fails. 
First of all, when continuing that run, we should immediately switch to a propor- 
tional coupling rather than continue attempting maximal couplings; this minimizes 
the amount of computation we will need to do later. Then, simply try again on the 
interval [— 2T, — T]. If the coupling succeeds this time, we will need to determine what 
happens to the special chain starting at {n~^, . . . , n~^) at time — 2T. Fortunately, this 
is not so difficult. We already know what has happened to it up to time —T. All of 
the proportional coupling steps that occur between — T and are easy to construct. 
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due to linearity and the fact that we have recorded what has happened to the vertices 
of the simplex. Successful maximal coupling steps are similarly easy to record, since 
the proof of fast coupling for the simplex gives a linear equation which can be used 
to determine the new coordinates. The only difficulty is what happens during the 
single unsuccessful maximal coupling. Fortunately, the remainder measure after a 
failed maximal coupling is just the sum of at most two uniform measures. When we 
get to this point, simply sample from that mixture independently of everything that 
has happened until that point. 

It should be noted that the above algorithm isn't special to our particular walk 
or target distribution on the simplex. For example, it will also work for the earlier 
small modifications based on changing the distribution of A We also observe that this 
algorithm doesn't need an a priori bound on the mixing time to run. In fact, running 
the algorithm can be used to rigorously check an estimated bound of time T, simply 
by running and checking convergence. This has the advantage of being fairly efficient 
even when it is not easy to draw from the stationary distribution. 

9. Acknowledgements 

The author thanks David Aldous for mentioning the problem, and Persi Diaconis, 
Olena Bormashenko and Daniel Jerison for many helpful conversations. 

References 

[1] David Aldous. Open problems. http://www.stat.berkeley.edu/ al- 

dous/Research/OP/indcx.html. 
[2] David Aldous and Jim Fill. Reversible Markov Chains and Random Walks on Graphs. 1994. 
[3] Bela Bollobas. Random Graphs, volume 73 of Cambridge Studies in Advanced Mathematics. 

Press Syndicate of the University of Cambridge, Cambridge, 2001. 
[4] Olena Bormashenko. A coupling proof for random transpositions. Preprint, 2011. 
[5] Robert Burton and Yevgeniy Kovchegov. Mixing times via super-fast coupling. Preprint, 2011. 
[6] Persi Diaconis. The markov chain monte carlo revolution. Bull. Amer. Math. Soc, 46(1):179- 

205, 2008. 
[7] Paul Erdos and Alfred Renyi. On a classical problem of probability theory. Magyar Tud. Akad. 

Mat. Kutato. Int. Kozl, 1961. 
[8] Tom Hayes and Eric Vigoda. A non-markovian coupling for randomly sampling colorings. FOGS 

proceedings, 2003. 
[9] Yuval Levin, David; Peres and Elizabeth Wilmcr. Markov Chains and Mixing Times. American 

Mathematical Society Providence, Rhode Island, 2009. 
[10] Laslo Lovasz. Hit and run mixes fast. Math. Prog., 1998. 
[11] Laslo Lovasz and Santosh Vempala. Hit and run is fast and fun. Technical Report - Microsoft 

Research, 2003. 
[12] James Propp and David Wilson. Exact sampling with coupled markov chains and applications 

to statistical mechanics. Rand. Struct. Alg., 9:223-252, 1996. 
[13] Dana Randall and Peter Winkler. Mixing points on a circle. Lecture Notes in Computer Science, 

3624:426-435, 2005. 



16 AARON SMITH 

[14] Dana Randall and Peter Winkler. Mixing points on an interval. Proceedings of ANALCO, 2005. 
[15] Jeffrey Rosenthal. Minorization eonditions and convergence rates for niarkov chain monte carlo. 

JASA, 90:558-566, 1995. 
[16] Wai Kong Yuen. Applications of geometric bounds to convergence rates of of niarkov chains 

and markov processes on rn. PhD Thesis, 2001. 

Department of Mathematics, Stanford University, Stanford, CA 94305 
E-mail address: asmith3@math.stanford.edu 



